Reshaping data
© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
People who know Excel, probably know the Pivot functionality:
The data of the table:
In [ ]:
excelample = pd.DataFrame({'Month': ["January", "January", "January", "January",
"February", "February", "February", "February",
"March", "March", "March", "March"],
'Category': ["Transportation", "Grocery", "Household", "Entertainment",
"Transportation", "Grocery", "Household", "Entertainment",
"Transportation", "Grocery", "Household", "Entertainment"],
'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]})
In [ ]:
excelample
In [ ]:
excelample_pivot = excelample.pivot(index="Category", columns="Month", values="Amount")
excelample_pivot
Interested in Grand totals?
In [ ]:
# sum columns
excelample_pivot.sum(axis=1)
In [ ]:
# sum rows
excelample_pivot.sum(axis=0)
Small subsample of the titanic dataset:
In [ ]:
df = pd.DataFrame({'Fare': [7.25, 71.2833, 51.8625, 30.0708, 7.8542, 13.0],
'Pclass': [3, 1, 1, 2, 3, 2],
'Sex': ['male', 'female', 'male', 'female', 'female', 'male'],
'Survived': [0, 1, 0, 1, 0, 1]})
In [ ]:
df
In [ ]:
df.pivot(index='Pclass', columns='Sex', values='Fare')
In [ ]:
df.pivot(index='Pclass', columns='Sex', values='Survived')
So far, so good...
Let's now use the full titanic dataset:
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.head()
And try the same pivot (no worries about the try-except, this is here just used to catch a loooong error):
In [ ]:
try:
df.pivot(index='Sex', columns='Pclass', values='Fare')
except Exception as e:
print("Exception!", e)
This does not work, because we would end up with multiple values for one cell of the resulting frame, as the error says: duplicated values for the columns in the selection. As an example, consider the following rows of our three columns of interest:
In [ ]:
df.loc[[1, 3], ["Sex", 'Pclass', 'Fare']]
Since pivot is just restructuring data, where would both values of Fare for the same combination of Sex and Pclass need to go?
Well, they need to be combined, according to an aggregation functionality, which is supported by the functionpivot_table
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.pivot_table(index='Sex', columns='Pclass', values='Fare')
In [ ]:
df.pivot_table(index='Sex', columns='Pclass',
values='Fare', aggfunc='max')
In [ ]:
df.pivot_table(index='Sex', columns='Pclass',
values='Fare', aggfunc='count')
In [ ]:
pd.crosstab(index=df['Sex'], columns=df['Pclass'])
In [ ]:
# %load snippets/06 - Reshaping data20.py
In [ ]:
# %load snippets/06 - Reshaping data21.py
In [ ]:
# %load snippets/06 - Reshaping data22.py
In [ ]:
# %load snippets/06 - Reshaping data23.py
The melt function performs the inverse operation of a pivot. This can be used to make your frame longer, i.e. to make a tidy version of your data.
In [ ]:
pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()
pivoted.columns.name = None
In [ ]:
pivoted
Assume we have a DataFrame like the above. The observations (the average Fare people payed) are spread over different columns. In a tidy dataset, each observation is stored in one row. To obtain this, we can use the melt function:
In [ ]:
pd.melt(pivoted)
As you can see above, the melt function puts all column labels in one column, and all values in a second column.
In this case, this is not fully what we want. We would like to keep the 'Sex' column separately:
In [ ]:
pd.melt(pivoted, id_vars=['Sex']) #, var_name='Pclass', value_name='Fare')
The docs say:
Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.
Indeed...
Before we speak about hierarchical index, first check it in practice on the following dummy example:
In [ ]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'],
'B':['a', 'b', 'a', 'b'],
'C':range(4)})
df
To use stack/unstack, we need the values we want to shift from rows to columns or the other way around as the index:
In [ ]:
df = df.set_index(['A', 'B']) # Indeed, you can combine two indices
df
In [ ]:
result = df['C'].unstack()
result
In [ ]:
df = result.stack().reset_index(name='C')
df
To better understand and reason about pivot tables, we can express this method as a combination of more basic steps. In short, the pivot is a convenient way of expressing the combination of a groupby and stack/unstack.
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.head()
In [ ]:
df.pivot_table(index='Pclass', columns='Sex',
values='Survived', aggfunc='mean')
In [ ]:
# %load snippets/06 - Reshaping data37.py
Like the pivot table above, we can now also obtain the result of melt with stack/unstack.
Let's use the same pivoted frame as above, and look at the final melt result:
In [ ]:
pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()
pivoted.columns.name = None
pivoted
In [ ]:
pd.melt(pivoted, id_vars=['Sex'], var_name='Pclass', value_name='Fare')
In [ ]:
# %load snippets/06 - Reshaping data40.py
In [ ]:
# %load snippets/06 - Reshaping data41.py
In [ ]:
# %load snippets/06 - Reshaping data42.py
In [ ]:
# %load snippets/06 - Reshaping data43.py
These exercises are based on the PyCon tutorial of Brandon Rhodes (so credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.
In [ ]:
cast = pd.read_csv('data/cast.csv')
cast.head()
In [ ]:
titles = pd.read_csv('data/titles.csv')
titles.head()
In [ ]:
# %load snippets/06 - Reshaping data46.py
In [ ]:
# %load snippets/06 - Reshaping data47.py
In [ ]:
# %load snippets/06 - Reshaping data48.py
In [ ]:
# %load snippets/06 - Reshaping data49.py
In [ ]:
# %load snippets/06 - Reshaping data50.py
In [ ]:
# %load snippets/06 - Reshaping data51.py
In [ ]:
# %load snippets/06 - Reshaping data52.py